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Abstract 

In  this  paper  we  describe  a family  of  parallel  sorting  algorithms 
for  a multiprocessor  system.  These  algorithms  are  enumeration  sortings  and 
comprise  the  following  phases:  (i)  count  acquisition:  the  keys  are  sub- 
divided into  subsets  and  for  each  key  we  determine  the  number  of  smaller 
keys  (count)  in  every  subset;  (ii)  rank  determination:  the  rank  of  a key 
is  the  sum  of  the  previously  obtained  counts;  (iii)  data  rearrangement: 
each  key  is  placed  in  the  position  specified  by  its  rank.  The  basic 
novelty  of  the  algorithms  is  the  use  of  parallel  merging  to  implement  count 
acquisition.  By  using  Valiant's  merging  scheme,  we  show  that  n keys  can 
be  sorted  in  parallel  with  n log^n  processors  in  time  C log^n  + o(log2n);  in 
addition,  if  memory  fetch  conflicts  are  not  allowed,  using  a modified  version  of 
Batcher's  merging  algorithm  to  implement  phase  (i),we  show  that  n keys  can 
be  sorted  with  n processors  in  time  (C'/o)  log2n  + o(log2n),  thereby  match- 
ing the  performance  of  Hirschberg's  algorithm,  which,  however,  is  not  free 
of  fetch  conflicts. 
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NEW  PARALLEL  SORTING  SCHEMES 
F.  P.  Preparata 


I . Introduction 

The  efficient  implementation  of  comparison  problems,  such  as  merging, 
sorting,  and  selection,  by  means  of  multiprocessor  computing  systems  has 
attracted  considerable  attention  in  recent  years.  One  of  the  earliest  funda- 
mental results  is  due  to  K.  E.  Batcher  [l],  who  proposed  a sorting  network 

consisting  of  comparators  and  based  on  the  principle  of  iterated  merging;  as 

2 

is  well-known,  such  scheme  sorts  n keys  with  0(n(logn)  ) comparators  in  time 
2 

O((logn)  ) . Batcher's  network  is  readily  interpreted,  in  a more  general 

framework,  as  a system  of  n/2  processors  with  access  to  a common  data  memory 

of  n cells:  obviously,  the  network  structure  induces  a nonadaptive  schedule 

of  memory  accesses.  After  the  appearance  of  Batcher's  paper,  substantial  work 

2 

was  aimed  at  filling  the  gap  between  the  upper-bound  0(  (logn)  ) on  the  number 

of  steps  which  is  achievable  by  a network  of  comparators  and  the  lower-bound 

O(logn);  the  lack  of  success,  however,  convinced  several  workers  to  look  for 

more  flexible  forms  of  parallelism. 

The  first  scheme  shown  to  sort  n keys  in  time  O(logn)  is  due  to 

D.  E.  Muller  and  F.  P.  Preparata  [2],  but  it  requires  a discouraging  number  of 
2 

0(n  ) processors.  Subsequently,  new  results  were  obtained  on  parallel 
merging  by  F.  Gavril  [3].  L.  G.  Valiant  [4]  must  be  credited  with 
addressing  the  fundamental  question  of  the  intrinsic  parallelism  of  some 


This  work  was  supported  in  part  by  the  National  Science  Foundation  under 
Grant  MCS76-17321  and  in  part  by  the  Joint  Services  Electronics  Program 
under  Contract  DAAB-07-72-C-0259 . 
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processors,  each  capable  of  random-accessing  a common  memory  with  no  alignment 
penalty.  Store,  fetch,  and  arithmetic  operations  have  unit  costs,  and  fetch 
conflicts  are  disallowed  when  appropriate. 

All  of  the  algorithms  described  in  this  paper  - as  well  as 
Hirschberg's  [5]  - are  instances  of  enumeration  sorting,  in  Knuth's  termi- 
nology ([6],  p.  73).  In  these  methods  each  key  is  compared  with  all  the 
others  and  the  number  of  smaller  keys  determines  the  given  key's  final 
position.  Specifically,  three  distinct  tasks  are  clearly  identifiable  in 
enumeration  sorting  algorithms: 

(i)  count  acquisition.  The  set  of  keys  is  partitioned  into  subsets 
and  for  each  key  we  determine  the  number  of  smaller  keys  in  each 
subset  (this  informal  description  momentarily  assumes  that  all 
keys  are  distinct)  ; 

(ii)  rank  computation.  For  each  key  the  sum  of  the  counts  obtained 
in  (i)  gives  the  final  position  (rank)  of  that  key  in  the 
sorted  sequence; 

(iii)  data  rearrangement.  Each  key  is  placed  in  its  final  position 
according  to  its  rank. 

Less  informally,  an  enumeration  sorting  scheme  has  the  following  format, 
where  we  assume  for  simplicity  that,  for  some  given  integer  r,  n = kr. 

Data  structures  to  be  used  ace  arrays  of  keys.  By  A[i: j ) we  denote  a 

sequence  A[i]A[i+l]. • - A[ j ]. 

Input:  A[0:n-1],  the  array  of  the  keys  to  be  sorted,  integer  r 

Output:  A[0:n-l],  the  array  of  the  sorted  keys. 
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1.  begin  Define  A^[0:r-l]  •-  A[ir: (i+l)r-l]  , for  i=0 k-1. 

I {Aj  (h)  |Aj  [h]  < A^  [ ^3  } | for  j < i 

2 C(ij)-J 

L'  1 \ 

y I CAj  (h>  |Aj  [h3  <At[j^]}|  for  j >i 

cjil)-  l £Aith3  |AtChJ  < At[l],h  < 1}  U {A1[h]|A1[h]  <A,[4j,h  > i) 

k-1 

3.  rank(A.  [*])  «-  T.  c;LJ') 

1 j=0  1 

4.  A[rank(Ai[jf])]  4-A^i] 


Note  that  count  acquisition,  rank  computation,  and  data  rearrangement  are 
performed,  respectively,  in  steps  2,  3,  and  4.  Also,  the  algorithm  must 
insure  that  all  ranks  be  distinct,  which  is  a crucial  condition  for  the  data 
rearrangement  task  (otherwise  memory  store  conflicts  would  occur).  This 
clearly  poses  no  problem  when  the  keys  are  all  distinct.  In  the  opposite 
case,  some  convention  must  be  adopted  for  the  ordering  of  sets  of  identical 
keys.  One  such  convention  is  that  sorting  be  stable  (see  [6],  p.  4),  that  is, 
the  initial  order  of  identical  keys  is  preserved  in  the  sorted  array.  Thus, 
all  of  our  sorting  schemes  will  be  stable.  This  is  reflected  in  the  rules 
for  the  computation  of  the  parameters  in  Step  2 of  the  above  algorithm. 

The  simple  algorithm  proposed  by  Muller  and  Preparata  in  [2] 

is  a crude  example  of  enumeration  sorting,  in  which  the  sets  A.,  are  chosen 

to  be  singletons.  With  this  choice,  each  key  is  compared  with  every  other 

2 

key,  thereby  using  0(n  ) processors;  similarly,  rank  computation  uses  0(n2) 
processors,  since  0(n)  processors  are  assigned  to  each  key.  The  time  bound 
O(logn)  is  due  to  Step  3 (counting  in  parallel  the  number  of  l's  in  a set 
of  n binary  digits),  whereas  Steps  2 and  4 run  in  constant  time  in  our  present 


model . 
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In  the  more  complex  procedures  to  be  later  described,  the  operations 
of  rank  computation  and  data  rearrangement  are  essentially  carried  out  as 
in  the  basic  scheme  described  above.  The  main  difference  occurs  with  regard 
to  count  acquisition.  In  the  Mul ler-Preparata  method  the  counts  are  acquired 
by  comparing  each  key  with  every  other.  The  comparison  of  two  keys  A[ij  and 

A[j  j could  be  viewed  as  merging  A[i]  and  A [ j 3 . If  rather  than  dealing  with 
single  keys  we  now  deal  with  sorted  sequences  of  keys  A^[0:r-1]  and  A.[0:r-ll, 
where  r > 1 and,  say,  j < i,  then  the  number  of  keys  in  A^COir-lj  which  are 
no  greater  than  A^[il  (X=0, . . . ,r-l)  as  well  as  the  number  of  keys  in  A^[0:r-ll 
which  are  less  than  A^[h]  (h=0, . . . , r-1) , can  be  obtained  by  merging  the  two 
sequences  A^O.-r-ll  and  AjO:r-l].  m fact,  let  B[0:2r-l]  be  the  array  obtained 
by  merging  the  two  sorted  arrays  A^tOrr-l]  and  A^[0:r-l]  with  the  ordering 
convention  A^[s]  — A^[s+l]  (k=i,j)  and  b[s]  ^ b[s+1].  Suppose  also  that  the 
merging  be  stable,  that  is,  the  order  of  identical  keys  in  the  concatenated 
array  Aj [O: r-l]A^[0: r-l]  is  preserved  in  B[0:2r-l],  If  B[q]  = A..[£],  then 
there  are  (q-A)  entries  of  Aj[0:r-l]  in  B[0:q-l]  which  are  no  greater  than 
A^[X];  similarly  if  B[q]  = A^th],  then  there  are  (q-h)  entries  of  A.[0:r-l] 
in  B[0:q-l]  which  are  strictly  less  than  Aj[h].  This  is  the  central  idea 
of  the  algorithms  to  be  described. 

2 . A fast  parallel  sorting  algorithm 

In  this  section  we  assume  that  in  our  computational  model  memory 
fetch  conflicts  are  permitted.  To  provide  the  feature  required  by  Valiant's 
merging  algorithm,  that  a key  be  simultaneously  compared  with  several  other 
keys,  we  may  assume  that  the  processors  have  broadcast  capabilities.  The 
only  overhead  we  shall  neglect  is  the  reassignment  of  processors  to  the 
operation  of  merging  pairs  of  subsequences,  as  occurs  in  Valiant's  method  [4]. 
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Notice  that  this  model  of  parallel  computation  coincides  with  that 
required  by  Valiant's  merging  algorithm. 

We  assume  inductively  that  the  following  algorithm,  SORTl,  for  p < n 
uses  at  most  [plogpj  processors  to  sort  p keys.  Since  SORTl  is 
recursive,  the  following  presentation  constitutes  a constructive  extension 
of  the  inductive  step  to  the  integer  n.  The  induction  can  be  started  with 
n 2 4. 

Algorithm  SORTl 

begin 

1.  k - I lognl  , r - Lfl/l"lognlJ 

2.  Define  arrays  s[0: k ;0: k ;0: 2r-l]  and  R[0:k ; 0 : k ; 0 : r-l] 

(three-dimensional  arrays)  and  A^[0:r-l]  - a[ ir : (i+1) r-l] ( i=0, . . . ,k- 1) , 
A^LChn-kr-l]  — A[kr:n-l] (for  n > kr) . 

Comment:  When  n=kr,  array  A^  is  obviously  vacuous.  Array  S is 

2 

defined  for  simplicity  as  having  (k+1)  2r  cells,  although  the 
algorithm  will  only  make  use  of  the  cells  s[i;j:qj  for  which  i < j. 

3.  A^O.-r-l]  - SORT  lfA^O:  r-l])  (i=0 k-1) 

Ak[0:n-kr-l]  - SORT  l^LO:  n-kr-l])  . 

Comment : This  step  is  a parallel  recursive  call  of  SORTl  and  it 
involves  sorting  in  parallel  k sets  of  r keys  each  and,  possibly, 
one  set  of  (n-kr)  keys.  By  the  inductive  hypothesis,  it  uses  at 
most  k LrlogrJ  + L(n-kr) log(n-kr) J = N processors.  Since 
n-kr  < Tlognl , the  number  of  processors  used  is  less  than 
["lognl  *|^  In/  ("lognl  J • login/  ("lognl  J j+  l_l"lognl  logl  lognl  J 

< nlog(n/ ("lognl  ) + ("lognl  log  Tlognl 

« nlogn- log  ("lognl  (n-  f lognl  ) S nlogn-1 

< LniognJ,  for  n 2 3.  For  the  sake  of  uniformity,  array  A^  is 
now  extended  to  size  r,  where  each  cell  of  A.  [n-kr: r-l]  is  filled 


T1 


with  a dummy  sentinel  larger  than  any  key. 
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4.  s[i;j ; 0 : r- 1 3 - At[0: r-l] (i=0, . . . ,k-l ; j=i+l,...,k) 
sLi;j;r:2r-l]  - A.[0:r-l](i=0, . . . , j-1;  j«l,...,k) 

Comment:  This  is  a copying  operation  whose  objective  is  to  obtain 
S [ i ; j ;0:2r-l]  = A^[0: r-l]Aj [0: r-l]  for  all  pairs  (i,j)  with  i < j. 

In  our  model,  this  operation  could  be  done  with  maximal  parallelism. 
However,  using  only  processors,  the  elementary 

copying  operations  are  completed  in  two  time  units.  For  later 
convenience  we  assume  that  the  record  associated  with  key  A^Lf] 
contains  a LABEL  consisting  of  the  pair  of  integers  (i,£). 

5.  S[i;j ;0:2r-l]  -MERGE  (s[ i ; j ;0: r-l] , s[i ; j ; r : 2r- l] ) 

(i=0 k-1;  J=i+1 k) 

Comment : This  step  uses  Valiant's  merging  algorithm  and  runs  in  time 
Cjloglogr  + 0(1),  for  some  constant  C^,  using  processors.  The 

original  version  of  Valiant's  merging  algorithm  can  be  readily 
modified,  so  that,  whenever  two  keys  are  identical  the  indices  of 
their  respective  subarrays  are  compared. 

6.  Let  (x,f)  - LABEL  s[i;j;q] 

I f x=i  then  R[i;j;f]  - q-£  else  R[j;i;f]  - q-f 
(i=0, . . . ,k-l ; j=i+l,...,k;  q=0, . . . , 2r-l) 

7.  R[i;i;f]  - l (i-0 k;  4=0, . . . , r-l) 

Comment : Steps  6 and  7 complete  the  count  acquisition  task.  In 

fact  after  Step  7 the  content  of  R[i;j;X]  is  C^^  , in  the 

terminology  of  Section  1.  Step  6 can  be  executed  in  two  time 
k+l\ 

units  using  [ 2 )r  processors,  whereas  Step  7 uses  (k+l)r 


processors  and  runs  in  one  time  unit. 


8. 


k 

rank (A  [i])-  I rC i ; j ; X]  (i-0,...,k;  £=0, . . . , r-1) 
j=0 

Comment : This  step  implements  the  rank  computation.  For  any 
pair  (i,i)  the  sum  can  be  computed  with  L(k+1)/2J  processors 
in  time  I log(k+l)l  ^ loglogn.  The  total  number  of  processors 
used  is  therefore  nL(k+l)/2J. 

9.  a[ rank(A^LAj) ] — A ^ L X j (1=0 k;  £=0,...,r-l) 

end 

To  complete  the  analysis  of  the  algorithm,  we  observe  that  none 
of  Steps  4-7  uses  more  than  1^2^ jr  P^ocessor*•  But 

r = b »/  riogni  J riogni  (JkpLti)  < n IlsgiLti 

where  the  last  inequality  is  due  to  the  removal  of  the  "floor"  sign. 

Also,  Step  8 uses  nL(k+l)/2j  < n(  Flo  gal  +1) /2 . Since,  for  all 
n ^ 4,n(^lognl+l)/2  < LhlognJ,  the  inductive  hypothesis  on  the  number  of 
processors  is  extended. 

Finally,  let  T(n)  denote  the  running  time  of  the  algorithm  for 
n keys.  Since  r n/logn  we  obtain 

T(n)  = T(lfe)  + C2l08l0gn  + C3 

for  some  constants  C ^ and  C^.  It  is  easily  verified  that  a function  of  the 
form  C2(logn)  + o(logn)  is  a solution  of  the  above  recurrence.  It  is  worth 
noting  that  for  the  same  number  of  processors,  Valiant  proposes  a sorting 
scheme  of  the  merge-sort  type  ([4],  Corollary  8)  which  runs  in  time 
21ogn ‘loglogn  - o(logn*loglogn) . 

3.  Parallel  sorting  algorithms  with  no  memory  fetch  conflicts 

We  shall  now  consider  a family  of  algorithms  for  sorting  n numbers 
in  parallel  with  n1+C*  processors  (0  < a < 1)  in  time  (C'/a)logn  + o(logn). 
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for  some  constant  C'.  Each  of  these  algorithms  has  the  same  performance 
as  the  corresponding  algorithm  by  Hlrschberg  [5],  although  no  memory 
fetch  conflict  occurs  in  this  case.  Again,  we  make  the  inductive  hypothesis 

I 

that  for  p < n.  Algorithm  S0RT2  uses  p processors  to  sort  p keys.  The 
format  of  S0RT2  closely  parallels  that  of  SORTl,  with  a few  crucial  differences 
to  be  noted. 

Algorithm  S0RT2 

begin 

1.  k *-  rna"l,  r •-  Ln/  l~na  ] J 

2.  Define  arrays  s[0:k;0: k;0: 2r-l] , R[0:k;0:k;0:r-l] 

and  A^tOir-l]  •-  Atir:  (i+l)r-l]  (i=0, . . . ,k-l)  ,A^[0: n-kr-lj  »-  A[kr: n-l]  for  n>kr. 

3.  A1tO:r-l]  - SORT2(A1[0:r-l])  (i+0, . . . ,k-l)  .A^O:  n-kr-l]  - SORT 2 AjiOtn-kr-lJ 
Comment : This  parallel  recursive  call  of  S0RT2  sorts  k sets  of 

r keys  each  and,  possibly,  one  set  of  n-kr  < k keys.  By  the 
inductive  hypothesis,  at  most  kr^+a  + (n-kr)^+a  = N processors  are 

1 4-fy  (Y  ry  ry  (v 

used.  Since  n-kr  < k,  then  N < kr  + (n-kr) ‘k  = kr(r  -k  ) + n-k  . 

Also  kr  = (VI  • |jn/  Ha*!  J ^ n,  whence  N < n(ra-ka+ka)  n»n^  a^a 

= n^+a  a < n^+Qf  , where  we  have  used  the  approximation  r =*■  n^  01 . 

Steps  1-3  are  analogous  to  the  corresponding  ones  in  SORTl;  however, 

the  copying  operation  implemented  by  Step  4 of  SORTl  must  be 

considerably  modified,  as  shown  by  the  following  Steps  4-6,  to 

avoid  fetch  conflicts.  Here  again,  A^  is  extended  to  size  r 

as  in  SORTl. 

s[i;k;0:r-l]  -A^Oir-l]  (i=*0, . . . ,k-l) 
sLO; j ;r : 2r-l]  - Aj[0:r-l]  (j=l k) 


4. 


i 


5.  for  m *-  0 step  1 until  riog(k+l)1  - 2 do 
sL i ; j -2m;0: r-l]  - s[ 1 ; j ;0: r-1 j 


(j=k-2m+l k;  1=0 j-2m-l) 


S[i+2m; j ;r: 2r-lJ  - s[ 1 ; j ; r : 2r- 1 J 


(i=0,...,2m-l;  j=i+2m+l,...,k) 


6.  Let  Tlog(k+l)l  - 1 = v. 

sLl ; j -2V ; 0 : r-l]  - s[i; j ;0:r-l] 


(j=2v+l,...,k;  1=0 j-2V-l) 


Sti+2v;j ;r:2r-l]  - s[l;j ;r;2r-lj 

(1=0, . . . ,k-2V-l ; j=i+2V+l,...,k) 

Comment : Steps  4-6  jointly  replicate  each  A^LChr-l]  the  required  number 
k of  times.  Step  4 is  an  Initial  copy;  Step  5 consists  of  (loglk+ll-l) 
stages,  each  of  which  doubles  the  ranges  of  the  Indices;  Step  h accounts 
tor  the  fact  that  k may  not  be  a power  of  2 and  completes  filling  the 
array  S.  Clearly  this  copying  operation  is  implemented  in 
logTk+ll  + 1 =“  orlogn  + 1 time  units.  A straightforward  analysis  shows 
that  the  largest  number  of  processors  used  in  any  of  these  stages  is  at 
most  5/16  of  the  total  number  (^^^r  of  cells  of  S to  be  filled.  It  is 
also  easily  shown  that  (5/ 16)^2 (5/16) (na+l)na-n*  ~a  < nl4(V  for  any 
n ^ 1 and  a > 0. 

7.  S L i ; j ; 0 : 2r-l]  - MERGE  (s[i ; j ;0: r-l] ,sL i ; j ; r : 2r-l J) 

(i=0, . . . ,k-l ; j=i+l k). 

Comment:  This  step  uses  a stable  version  of  Batcher's  merging  algorithm  [l], 
which  is  easily  obtained  by  requiring  that  whenever  two  identical  keys  are 
encountered  their  array  indices  be  compared  (see  Appendix).  The 
following  facts  about  Batcher's  merging  algorithm  are  well-known: 
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(i)  no  fetch  conflict  occurs  because  at  any  stage  (or,  time  unit) 
each  key  is  compared  with  exactly  one  key;  (ii)  ^ ^ (n°  + l)nQf/2] . 

nla  < n^Ha  processors  are  used;  (iii)  merging  is  completed  in 
logr  ^ (l-a) logn  time  units. 

8.  Steps  8,  9,  10,  and  11  of  this  algorithm  are  respectively 

identical  to  Steps  6,  7,  8,  and  9 of  SORTl  and  are  therefore 
omitted.  The  latter  are  clearly  free  of  memory  fetch  conflicts.  The 
analysis  of  SORTl  showed  that  at  most  max^k^jr.n  L(k+1)  / 2 J ^ 
processors  were  used  in  any  of  those  steps.  In  the  present  case, 


we 


have  already  shown  that  similarly  we  conclude 


nl(k+l)/2J  < n(na+l)/2  < n1+a. 


From  the  performance  viewpoint,  all  steps  of  the  algorithm  require 
l+O' 

at  most  n processors,  as  postulated.  This  extends  the  inductive  hypothesis 

or.  the  number  of  processors  used  by  the  algorithm.  As  to  the  running  time  T(n) , 

we  note  the  following:  Steps  4-6  jointly  require  alogn  + 1 time  units; 

Step  7 requires  (l-a) logn  time  units;  Step  10  requires  alogn  time  units; 

Steps  8,  9,  and  11  run  in  constant  time.  Since  Step  3 is  a recursive  call 

l-a 

of  S0RT2  on  sets  of  t ^ n elements,  we  obtain  for  T(n)  the  recurrence 
equation 

T(n)  = T(n1_a)  + (CjotCplogn  + 

for  some  constants  C|,  C^,  and  C y It  is  easily  verified  that  a iunction  of 
the  form  [C^at-C^) /a] logn  + o(logn)  is  a solution  of  this  equation,  whence 
T(n)  (C* /a) logn  + o(logn). 
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Appendix 

A stable  version  of  Batcher's  merging  algorithm. 

The  original  version  of  Batcher's  odd-even  merging  algorithm  runs 
as  follows  (here,  for  simplicity,  we  assume  that  the  common  length  of  the 
sequences  to  be  merged  is  a power  of  2) : 

MERGE (a[0:  2k"1-l],  A[2k'1;  2k_1]) 

1.  A'[j]  *-  A[2j],  A'[2k_1  + j]  — A[2j+l]  ( j-0, 1 , . . . , 2k'1  -1) 

2.  B[0:  2k _ 1 - 1 ] - MERGE  (A'tO:  2k_2-l],  A'[2k‘2:  2k'1-l]) 

B[2k'1:  2k-l]  -MERGE  (A'[2k_1:  3.2k_2-l],  A'[3.2k'2:  2k-l] ) 

3.  A[2J-1]  - min(B[j],  B[2k_1  + j-l]),  A[2j]  - max(Btj],  B[2k_1  + j-l] 

(J«l,...,2k_1-1) 

This  algorithm  is  not  stable  (see  [6],  p.  135,  exercise  13),  because  in 
compliance  with  the  rules  of  the  comparator  module,  whenever  B[j]  = 

B[2k‘1  + j-l]  in  Step  3,  the  algorithm  assigns  At 2 j - 1 3 — B C j ] , 

At  2 j J - B[2k  * + j-lj.  Fortunately,  however,  with  a simple  modification, 
stability  can  be  attained.  Specifically,  we  associate  with  each  key  of  the 
initial  array  a[0:  2k-l]  a label,  and  set  LABEL(A[j])  — j (notice  that  all 
labels  are  distinct).  We  then  replace  Step  3 above  with  the  following  step: 

3’.  I_f  B[j]  = B[2k  1 + J-l]  then  A[2j-l]  - key  with  smaller  label 

A[2j]  — key  with  larger  label 

else  A[2J-l]  - min(B[j],  B[2k_1  + J-l]),  At  2j  3 - max(B[j],  B^'1  + J-l]) 

(J«l,....2k'l-1)  . 

We  now  prove  that  the  new  version  of  the  algorithm  is  stable. 

Assume  that  in  the  original  array  a[0:  2k  *-l]  the  subarrays  AfO:  tj-l], 

A[ti;  . -1],  and  a[s^:  2k”*-l]  contain  keys  which  are  respectively  less, 
equal,  and  larger  than  some  fixed  value  a;  similarly  for  A[2k  2k  *+t  -l]. 
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A[2k  *+t2:  *+  s2"l-l«  an<^  A[2^  *+  s^:  2k-l].  Assume  inductively  that 

the  merged  sequences  obtained  in  Step  2 are  stably  sorted  and  consider  a 
key  A[p]  for  p€[t^,s^-l].  Assume  at  first  p = 2j;  then,  by  Step  1, 

A'[j]  = a[ 2 j 3 . Moreover,  there  are  keys  in  A*[2k  2k  *-1]  strictly 

smaller  than  A[2j];  whence  aL 2 j 3 = bL j + l*t  /2l].  According  to  Step  3' 

B[j+  f t ^ /2 ) J is  compared  with  b[2^  * + j-1  + 1^/21  ].  Suppose  t^  is  even: 
then  if  <2  j - 1)  6 [t^.s^l],  B[2k*1  + j-1  + T t2/2l  3 = A|.2j-l]  and  we  compare 
(the  labels  of)  A[2j]  and  A[2j-l];  otherwise,  i.e.,  when  2 j - 1 = t^-1  or, 
equivalently,  A[2j]  = A[t^],  the  latter  is  compared  with  a key  less  than  a. 

Suppose  now  that  t^  is  odd:  then  if  (2j+l)  € [t^,s^-l],  we  have 

B[2k  * + j-1  + r t ^ / 2l 3 = A[2j+l]  and  we  compare  (the  labels  of)  A[2j]  and 
A[2j+l];  otherwise  A[2j]  = aLs^-1]  and  A[s^-l]  is  compared  with  a key  which 
is  either  larger  or  has  a larger  label.  Clearly,  in  the  given  case,  stability 
is  ensured,  and  by  analogous  arguments  we  can  treat  all  other  cases. 


